SF Salaries

The purpose of this project is to demonstrate the pandas library data manipulation operations. In order to do that, we'll use San Francisco dataset available on (https://www.kaggle.com/kaggle/sf-salaries). This dataset contains annual data of employees from 2011 to 2014.

Importing package



In [3]:

    
import pandas as pd

Read CSV into dataframe 'sal'



In [4]:

    
sal = pd.read_csv('Salaries.csv')

Head of dataframe



In [5]:

    
sal.head()









    Out[5]:






  
    
      
      Id
      EmployeeName
      JobTitle
      BasePay
      OvertimePay
      OtherPay
      Benefits
      TotalPay
      TotalPayBenefits
      Year
      Notes
      Agency
      Status
    
  
  
    
      0
      1
      NATHANIEL FORD
      GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
      167411.18
      0.00
      400184.25
      NaN
      567595.43
      567595.43
      2011
      NaN
      San Francisco
      NaN
    
    
      1
      2
      GARY JIMENEZ
      CAPTAIN III (POLICE DEPARTMENT)
      155966.02
      245131.88
      137811.38
      NaN
      538909.28
      538909.28
      2011
      NaN
      San Francisco
      NaN
    
    
      2
      3
      ALBERT PARDINI
      CAPTAIN III (POLICE DEPARTMENT)
      212739.13
      106088.18
      16452.60
      NaN
      335279.91
      335279.91
      2011
      NaN
      San Francisco
      NaN
    
    
      3
      4
      CHRISTOPHER CHONG
      WIRE ROPE CABLE MAINTENANCE MECHANIC
      77916.00
      56120.71
      198306.90
      NaN
      332343.61
      332343.61
      2011
      NaN
      San Francisco
      NaN
    
    
      4
      5
      PATRICK GARDNER
      DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)
      134401.60
      9737.00
      182234.59
      NaN
      326373.19
      326373.19
      2011
      NaN
      San Francisco
      NaN

Finding some basic info about the data through the info() function



In [6]:

    
sal.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id                  148654 non-null int64
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

Average BasePay



In [7]:

    
sal['BasePay'].mean()









    Out[7]:





66325.44884050643

Highest amount of overtime pay in the dataset



In [8]:

    
sal['OvertimePay'].max()









    Out[8]:





245131.88

What is the job title of JOSEPH DRISCOLL?



In [9]:

    
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']









    Out[9]:





24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

How much does JOSEPH DRISCOLL make (including benefits)?



In [10]:

    
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']









    Out[10]:





24    270324.91
Name: TotalPayBenefits, dtype: float64

The name of highest paid person (including benefits)?



In [11]:

    
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]









    Out[11]:






  
    
      
      Id
      EmployeeName
      JobTitle
      BasePay
      OvertimePay
      OtherPay
      Benefits
      TotalPay
      TotalPayBenefits
      Year
      Notes
      Agency
      Status
    
  
  
    
      0
      1
      NATHANIEL FORD
      GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY
      167411.18
      0.0
      400184.25
      NaN
      567595.43
      567595.43
      2011
      NaN
      San Francisco
      NaN

The name of lowest paid person



In [12]:

    
sal[sal['TotalPayBenefits']  == sal['TotalPayBenefits'].min()]









    Out[12]:






  
    
      
      Id
      EmployeeName
      JobTitle
      BasePay
      OvertimePay
      OtherPay
      Benefits
      TotalPay
      TotalPayBenefits
      Year
      Notes
      Agency
      Status
    
  
  
    
      148653
      148654
      Joe Lopez
      Counselor, Log Cabin Ranch
      0.0
      0.0
      -618.13
      0.0
      -618.13
      -618.13
      2014
      NaN
      San Francisco
      NaN

The average (mean) BasePay of all employees per year? (2011-2014)?



In [13]:

    
sal.groupby('Year').mean()['BasePay']









    Out[13]:





Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

Number of unique job titles



In [14]:

    
sal['JobTitle'].nunique()









    Out[14]:





2159

Top 5 most common jobs



In [15]:

    
grouped = sal.groupby('JobTitle').count()
top5 = grouped.sort_values(by='Id',  ascending=False)[:5]
top5['Id']









    Out[15]:





JobTitle
Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: Id, dtype: int64

Job titles represented by only one person



In [16]:

    
copied_sal = sal[sal['Year'] == 2013]
group = copied_sal.groupby('JobTitle').count()
count = group[group['Id'] == 1]
count.count()['Id']









    Out[16]:





202

How many people have the word Chief in their job title?



In [17]:

    
def find_chief(job_title):    
    if 'chief' in job_title.lower().split():
        return True
    else:
        return False

sal = pd.read_csv('Salaries.csv')

sum(sal['JobTitle'].apply(lambda x: find_chief(x)))









    Out[17]:





477

Is there a correlation between length of the Job Title string and Salary?



In [18]:

    
sal['title_len'] = sal['JobTitle'].apply(len)

sal[['title_len', 'TotalPayBenefits']].corr()









    Out[18]:






  
    
      
      title_len
      TotalPayBenefits
    
  
  
    
      title_len
      1.000000
      -0.036878
    
    
      TotalPayBenefits
      -0.036878
      1.000000

	Id	EmployeeName	JobTitle	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year	Notes	Agency	Status
0	1	NATHANIEL FORD	GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY	167411.18	0.00	400184.25	NaN	567595.43	567595.43	2011	NaN	San Francisco	NaN
1	2	GARY JIMENEZ	CAPTAIN III (POLICE DEPARTMENT)	155966.02	245131.88	137811.38	NaN	538909.28	538909.28	2011	NaN	San Francisco	NaN
2	3	ALBERT PARDINI	CAPTAIN III (POLICE DEPARTMENT)	212739.13	106088.18	16452.60	NaN	335279.91	335279.91	2011	NaN	San Francisco	NaN
3	4	CHRISTOPHER CHONG	WIRE ROPE CABLE MAINTENANCE MECHANIC	77916.00	56120.71	198306.90	NaN	332343.61	332343.61	2011	NaN	San Francisco	NaN
4	5	PATRICK GARDNER	DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)	134401.60	9737.00	182234.59	NaN	326373.19	326373.19	2011	NaN	San Francisco	NaN